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Abstract 

The Supercomputer Toolkit is a proposed family of standard hard- 
ware and software components from which special-purpose machines 
can be easily configured. Using the Toolkit, a scientist or an engineer, 
starting with a suitable computational problem, will be able to readily 
configure a special purpose multiprocessor that attains supercomputer- 
class performance on that problem, at a fraction of the cost of a general 
purpose supercomputer. 

The Toolkit is currently being built as a joint project between 
Hewlett-Packard and MIT. The software and the applications are in 
various stages of development and research. 
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The Supercomputer Toolkit and its Applications 
1 Introduction 

The Supercomputer Toolkit is a proposed family of standard hardware and 
software components from which special-purpose machines can be easily con- 
figured. Using the Toolkit, a scientist or an engineer, starting with a suitable 
computational problem, will be able to readily configure a special-purpose 
multiprocessor that attains supercomputer-class performance on that prob- 
lem, at a fraction of the cost of a general-purpose supercomputer. 

Each type of Toolkit hardware module will be implemented as an individ- 
ual board. The boards fit into a common chassis that furnishes only power 
and ground. Special cables are used to achieve high-speed communication 
among boards and to distribute the clock. A user assembles a machine by 
plugging in the required modules and connecting the cables appropriately. 
When a particular machine is no longer needed, it can be disassembled, 
and its modules can be reassembled into other configurations. As of June, 
1990, we have designed, fabricated, and are beginning to benchmark the ba- 
sic Toolkit processor module, tailored for high-performance double-precision 
floating-point operations. A typical configuration will include several proces- 
sor modules. Other hardware modules we hope to develop will provide for 
mass memory and high-speed data-acquisition. 

The intent of this arrangement is to make it simple and relatively in- 
expensive to configure special-purpose computational engines. Yet even if 
appropriate hardware modules were readily available, these would not be 
of much use if programming each new machine entailed a major software- 
development effort, or required an intense analysis to exploit the available 
parallelism effectively. We believe that, for suitable scientific-computing ap- 
plications, one can compile extremely high-performance code from high-level 
languages, and moreover, that the compiler can automatically synthesize a 
pattern of interconnection well-matched to the program being compiled, as 
well as automatically schedule the computation to make effective use of the 
available parallelism. In addition to this novel compiler, the software support 
for the Toolkit will include an assembler, a simulator, and debugging tools. 
There will also be standard software components, such as a scientific library, 
for inclusion in Toolkit programs. 



We envision that the Toolkit will be used as follows: 

One begins with an algorithm that performs the costly inner loop of a 
computation that is important enough to warrant constructing a special- 
purpose machine. For example, the simulation part of a multidimensional 
optimization in the computer-aided design of an analog circuit, or the inte- 
gration of the differential equations required to achieve the real-time control 
of a nonlinear process, are appropriate for Toolkit implementation. 

The Toolkit software will be used to compile the program, targeted for 
a number of different Toolkit hardware configurations, some proposed by 
the user, others generated automatically by the Toolkit compiler itself. The 
compiler will also produce, for each configuration, a simulation that the user 
can run on the host machine to help evaluate price-performance tradeoffs. 
After a configuration has been selected, the user will obtain the required 
modules, wire them together, and connect the machine he has built to a host 
computer. The configuration will be verified by means of diagnostics that 
are automatically generated and loaded from the host. The target program 
will then be loaded, and the new machine will be ready to be used by host 
programs as a back-end processor. 

2 Historical Motivation 

The Digital Orrery [2], constructed in 1983-1984, is a special-purpose numer- 
ical engine optimized for high-precision numerical integrations of the equa- 
tions of motion of small numbers of gravitationally interacting bodies. Using 
1980 technology, the device is about 1 cubic foot of electronics, dissipating 
150 watts. On the problem it was designed to solve, it was measured to be 
60 times faster than a VAX 11/780 with FPA, or 1/3 the speed of a Cray IS. 
The Orrery achieves this performance at modest cost for two reasons. 
Its communication paths are specialized for the solar-system problem. It is 
organized as a ring of up to ten processing elements, one for each body to 
be simulated. The algorithm passes the states of the n bodies around the 
ring, allowing the computation of all n 3 accelerations in order n time, with 
negligible communication cost. Additionally, the program that performs the 
integration completely exploits the data-independence that is inherent in the 
problem. All available cycles are used for floating-point operations; none are 
used to support data-structure references. 



In 1988, G. Sussman and J. Wisdom used the Orrery to demonstrate that 
the long-term motion of the planet Pluto, and by implication the dynamics of 
the Solar System, is chaotic [3]. This required integrating the positions of the 
outer planets for a simulated time of 845 million years, which required run- 
ning the Orrery continuously for more than three months. Before the Orrery, 
high-precision integrations over simulated millions of years were prohibitively 
expensive, and astrophysicists had done only a few small experiments using 
carefully scheduled resources. 

The objective of our work is to generalize and automate the preparation 
of such computing instruments. Starting from a mathematical description of 
an application — for example the equations of motion of the outer planets — a 
scientist should be able to use the Toolkit to build a modern version of the 
Digital Orrery in about a week of effort, complete with software. With the 
same components, and with a similar amount of effort, an engineer should 
be able to configure a machine, with software, to optimize the design of a 
high-frequency nonlinear circuit such as a phase-locked loop. 

3 Applications 

The ability to easily configure special-purpose hardware opens up a variety of 
important applications that rely upon the ability to perform high-precision 
simulations in real-time or faster than real-time. 

For example, hardware- in- the- loop techniques are used in the develop- 
ment of mechanical systems — the design of a mechanical assembly may be 
simplified by instrumenting already-designed physical parts and coupling 
these to actuators driven by simulations of other parts of the assembly. Usu- 
ally this is done with analog or hybrid computers, but special-purpose digital 
systems configured from general components could be cheaper, more accu- 
rate, and much more flexible. 

In the automatic control of highly nonlinear plants, there are techniques 
that rely upon being able to simulate the dynamics of the plant faster than 
real time, so as to predict the consequences of proposed control actions. Often 
it is desirable to operate a plant close to a point of catastrophic failure. The 
extent to which such control strategies can be safely implemented depends 
upon the quality of the dynamical model of the plant and upon the speed of 
computation available to the control engineer. General-purpose computers 



with physical characteristics appropriate for use in controllers are inadequate 
for this use in all but very slow systems. 

Alternatively, consider the situation of an electrical engineer optimizing 
the design of an important nonlinear circuit, such as an analog- to- digital 
converter. Evaluating each choice of device parameters requires a difficult 
simulation that needs many hours of time on a workstation-class computer. 
Typically, the engineer will run a simulation overnight and adjust the param- 
eters after evaluating the result the next day. It is not uncommon for this 
work to continue for several months. With the Toolkit, the engineer could in- 
stead invest a week's effort, analyzing the problem and evaluating alternative 
Toolkit configurations, to design and configure a special computer to speed 
up the simulation. If each simulation required half a minute rather than 5 
hours, the optimization could be performed using automatic algorithms; in a 
week of continuous running, a program could achieve a better optimum than 
manual methods could ever discover. 

4 Hardware 

The basic Toolkit processor module contains a few arithmetic execution units, 
a small high-speed multiport memory, and a simple controller. In our pro- 
totype, each processor module may connect to other modules via two bi- 
directional I/O ports, each of which may connect to other units (Each module 
can connect to about ten others, but we have not yet determined the limits 
here). All the modules and communication paths of a Toolkit configuration 
are synchronized by a common clock. One can configure any interprocessor 
connection graph, within the fan-out limits, by using a processor for every 
branch, where the interconnections are the nodes (see figure 1). 

In our prototype, each board has a peak scalar floating-point speed of 28 
double-precision Mflops, and we expect to be able to sustain performance 
of about half this rate (per board) on real problems. The current design is 
constructed from off-the-shelf components, and can be easily duplicated at 
modest cost. 

Figure 2 shows the overall structure of the processor module. 

Our goal in designing this board was to use the fastest floating-point 
chips available and to provide enough bandwidth to keep them fully utilized. 
We chose the two-chip (ALU and multiplier) floating-point chip set made 
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Figure 1: Each processor module has two bidirectional I/O ports. The figure shows how 
this allows one to build various network architectures: a mesh, a ring and communicating 
clusters. 



by Bipolar Integrated Technologies (B.I.T.) and the fastest easily-available 
memory (20-ns 16Kx4 SRAM). 

The floating-point unit (FPU) can multiply two 64-bit arguments during 
the time it takes to transfer one word from memory. Thus, our desire to 
obtain a balanced system, in which the FPU is not starved by the memory, 
required that there should be two separate memories. The memories commu- 
nicate with the FPU via a 32-entry register array with 5 ports: a read/write 
port to each memory, two read ports that supply floating-point arguments, 
and a write port for the floating-point result. The register array is config- 
ured from four B.I.T. 5-port 18-bit register-file chips. (This required some 
clever design and a delicate clocking scheme.) All of the data paths in our 
prototype are byte-parity protected. 

Addresses are supplied to each memory by its own address generator, 
which was implemented with a 16-bit wide 2901-style microprocessor. Con- 
trol for the entire processor module is expressed with a very long instruction 
word — 168 bits of horizontal code — that are stored in a 16K deep micropro- 
gram memory. The memory is implemented with the same kind of 16Kx4 
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Figure 2: This is the overall architecture of the prototype processor module, consisting of 
a fast floating-point chip set, a 5-port register file, two memories and address generators, 
and a sequencer. 



SRAMs that we used for the data memories. The microcode memory is ad- 
dressed using an 16-bit wide 2910-style microprogram sequencer, which also 
provides limited subroutine and branching capabilities. 

We chose the very long instruction word format because during each cycle 
(about 70 ns) an instruction needs to specify independent operations for the 
multiplier, the ALU, transfers among the registers, memories, and the I/O 
ports, an instruction for each of the address generators, and an operation for 
the sequencer. 

Figure 3 shows the layout of the prototype processor module on a 13" x 15" 
board, which fits into a standard HP chassis. We expect to assemble a 
machine with 5 to 10 of these boards during the summer of 1990. 

To build a system with several boards we interconnect their I/O ports 
using controlled-impedance transmission lines, terminated at the ends. Each 
port can be used to transmit a 64-bit word between processors in two cycles. 
As there is no hardware arbitration on the I/O ports, it is necessary that 
the programmer develop a convention for controlling access to each commu- 
nication channel. To prevent bad programs from burning up the drivers the 
ports are implemented using open-collector TTL transceivers that can drive 
impedances as low as 30 Ohms. 

To avoid reflections the transmission lines are never branched. They 
enter the board on one connector, are routed to transceivers and then exit 
the board on another connector. Careful layout minimizes stubs along the 
interconnect path. The impedance on the board is the same as the impedance 
of the ribbon cable used for interconnect. 1 

Since each board has two I/O ports, rearranging cables permits one to 
statically configure any interconnection scheme (within fanout limits), in 
which each processor may communicate with two distinct sets of neighbors. 
For example, figure 4 shows how one uses this scheme to configure a 4- 
processor cluster. 

The entire machine is intended to be a back-end computer that communi- 
cates with a host computer via a parallel interface. Communication with the 
host is significantly slower than communication between boards. Thus, the 
present prototype is best suited for computations where only a small amount 
of data is transferred between the Toolkit processors and the host. 



1 Henry Wu was instrumental in developing this interconnect technology for our boards. 
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Figure 3: Layout of the prototype on a 13" x 15" board. 
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Figure 4: Interconnection between modules is accomplished by transmission lines, al- 
lowing one to statically configure any interconnection network in which each processor is 
connected to at most two nodes. The figure shows how to connect cables to create two 
communicating 4-processor clusters. The boxes marked "V are terminators. 
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5 Software 

5.1 Low-Level Programming Model 

Each supercomputer Toolkit processor is programmed as a Very Long In- 
struction Word (VLIW) computer. In every cycle, the following operations 
can be performed in parallel (see figure 2): 

* Two memory transactions, one to left memory and one to right memory. 
Each memory can perform a load or store operation with the register file on 
each cycle. 

* Two memory address computations, to generate the addresses that will 
be used to access the memories during the following cycle. The address 
generators have their own internal register files to support these operations. 

* One program-counter operation - conditional branch, jump, call, push/pop, 
etc. 

* One floating-point ALU operation and one floating-point multiply op- 
eration. The ALU and multiplier receive their inputs and store their results 
into the main register-file. 

Both arithmetic chips can be operated simultaneously. However, since 
both the ALU and the multiplier share register-file ports, it is necessary that 
they do not simultaneously require access to the register-file. For example, 
while the multiplier is busy doing an operation such as square-root, that takes 
several cycles to complete, the register-file ports can be used to supply data 
to the ALU. Operations such as multiply-accumulate use internal feedback 
paths within the arithmetic chips, thereby freeing up register-file ports. 

* Each processor has two I/O ports, each of which is connected to a 
communication channel. Two cycles are required to transmit a single 64-bit 
word. Accessing an I/O port uses the internal memory bus for one cycle. 
Thus, a LEFT I/O operation and a LEFT memory operation can not both 
be performed during the same cycle. 

When multiple processors are to be used for a single application, several 
programming styles are possible. The simplest style is to have the program 
counters on all of the boards act in lock-step, effectively forming a multiple 
board VLIW machine. An alternative is to program the processors in a 
MIMD style. In the MIMD style, the processors run totally independent 
programs, exchanging messages via the communication channels as needed. 

To support more complex programming styles that combine aspects of 
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both the VLIW and MIMD styles, the hardware provides a wired-or flag for 
synchronizing control among multiple boards. For example, in the integration 
of differential equations, where different state variables have large variations 
in time scale, it is advantageous to use integrators that admit variable and 
individual stepsizes. In such systems, some parts of the process can proceed 
in VLIW fashion, counting out cycles to maintain synchronization, but other 
parts may need explicit synchronization to keep the individual state variable 
integrators in step. 

5.2 Compilation 

We intend to automatically compile and schedule high-performance code for 
multiple Toolkit modules and automatically generate an appropriate pattern 
of interconnect, but we have not done that yet. Certainly, the task of pro- 
gramming parallel machines in general is extremely difficult. However, we 
believe that there are special characteristics of common numerical methods 
that make automatic scheduling and network generation feasible for a large 
class of important scientific and engineering applications. 

On the other hand, one can make progress using more modest software 
support. The Orrery was programmed using a fairly simple symbolic mi- 
crocode assembler. This was possible since the solar-system simulation is 
not a very complicated program. The partitioning of the problem into pro- 
cesses, the assignment of these processes to processors, and the programming 
of the connections between processors can be derived from knowledge of the 
problem. 

This kind of low-level programming can be done with the Toolkit now. 
However, we have developed a compiler that automates the process of build- 
ing Orrery-like programs. A user specifies, in a high-level language, the 
straight-line program to be executed in each processor separately. These 
fragments can be manually glued together to allow simple communication 
patterns and to construct loops. 

The compiler, built by Andy Berlin and Bill Rozas, generates efficient 
code by using partial evaluation [4, 5] to "flatten" a program. This produces 
code that contains extremely long straight-line sequences of numerical op- 
erations (often several thousand operations long). This makes it feasible to 
re-order operations to account for pipeline delays, allowing the floating-point 
units to be fully utilized. In addition, this allows data motion instructions, 
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such as memory fetches, to be initiated far in advance of the numerical oper- 
ation that needs the data. Work on the Supercomputer Toolkit compiler has 
progressed to the point where we can schedule the a solar- system program in 
such a way as to keep one processor fully utilized. We are now working on 
generalizing this approach to schedule code for multiple Toolkit processors. 

5.3 The Dynamicist's Workbench 

Ultimately we expect the Toolkit to be the workhorse for the Dynamicist's 
Workbench, a tool that will aid scientists and engineers in the simulation 
and analysis of dynamical systems. The Workbench includes a spectrum 
of computational tools — including numerical methods and symbolic algebra. 
These tools are designed so that combined methods, tailored to particular 
problems, can be constructed on the fly. 

For example, one can specify a circuit optimization problem in terms of 
the circuit diagram. One can investigate the dynamics of a double pendulum 
in terms of a Lagrangian that describes it. The Dynamicist's Workbench 
starts with such descriptions and constructs appropriate numerical proce- 
dures for simulations and optimizations. It automatically prepares varia- 
tional equations and sensitivity analysis codes. 

Parts of the programs generated by the Dynamicist's Workbench are 
further compiled by the Toolkit compiler to make microcode for individual 
Toolkit boards. Other parts of the Workbench code will be used to construct 
host-interface software and analysis code to be run in the host. 

6 Summary 

The Toolkit project is not meant to address the difficult issues of large-scale 
parallel computation. Neither the hardware architecture we propose, nor 
the interconnection technology, nor in all likelihood our software ideas can 
be expected to scale to systems with many hundreds of processors. Our 
goal is to realize means, practical within the limits of current technology, to 
provide relatively inexpensive supercomputer performance for a limited, but 
important class of problems in science and engineering. We expect even our 
prototype implementation to be useful for problems modeled with systems 
of ordinary differential equations. Additional Toolkit modules that we hope 
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to develop may make other applications feasible, but we have not discussed 
applications here that require large memory or for which an appropriate 
Toolkit configuration would be bigger than a few boards. Simulation of fluid 
flow is one such example. Other examples can be found in [8]. 

Efforts with similar goals include the NuMesh effort at MIT, and the 
iWARP work at CMU [7]. 

There are other promising strategies for parallel computation, represented 
by machines such as the MIT Monsoon Dataflow machine, the Connection 
Machine, the Multiflow computer, and many others. These are general- 
purpose machines. Our idea differs in that we intend to statically configure 
both hardware and software for each particular problem. Thus we require no 
general-purpose software (such as an operating system), no routing protocols, 
and no hardware to support these features. We believe that when attempting 
to obtain maximum performance for a fixed level of technology, we cannot 
afford to pay the price of features intended to support generality. 

As a result of its high performance, relative ease of programming and low- 
cost we expect the Toolkit to have an impact on scientific and engineering 
computation. 
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